Integrating acoustic and state-transition models for free phone recognition in L2 English speech using multi-distribution deep neural networks
نویسندگان
چکیده
This paper investigates the use of Multi-Distribution Deep Neural Networks (MD-DNNs) for integrating acoustic and statetransition models in free phone recognition of L2 English speech. In Computer-Aided Pronunciation Training (CAPT) system, free phone recognition for L2 English speech is the key model of Mispronunciation Detection and Diagnosis (MDD) in the cases of allowing freely speaking. A simple Automatic Speech Recognition (ASR) system can be approached with an Acoustic Model (AM) and a State-Transition Model (STM). Generally, these two models are trained independently, hence contextual information maybe lost. Inspired by the AcousticPhonological Model, which achieves greatly improvements by integrating the AM and Phonological Model (PM) in MDD for the cases that L2 learners practice their English by following the prompts, we propose a joint Acoustic-State-Transition Model (ASTM) which uses a MD-DNN to integrate the AM and STM. Preliminary experiments with basic parameter configurations show that the ASTM obtains a phone accuracy of about 68% on the TIMIT data. It is better than the system of using separate AM and STM, whose accuracy is only about 52%. Further finetuning the ASTM achieves an accuracy of about 72% on the TIMIT data. Similar performance is obtained if we train and test the ASTM on our L2 English speech corpus (CU-CHLOE).
منابع مشابه
Persian Phone Recognition Using Acoustic Landmarks and Neural Network-based variability compensation methods
Speech recognition is a subfield of artificial intelligence that develops technologies to convert speech utterance into transcription. So far, various methods such as hidden Markov models and artificial neural networks have been used to develop speech recognition systems. In most of these systems, the speech signal frames are processed uniformly, while the information is not evenly distributed ...
متن کاملشبکه عصبی پیچشی با پنجرههای قابل تطبیق برای بازشناسی گفتار
Although, speech recognition systems are widely used and their accuracies are continuously increased, there is a considerable performance gap between their accuracies and human recognition ability. This is partially due to high speaker variations in speech signal. Deep neural networks are among the best tools for acoustic modeling. Recently, using hybrid deep neural network and hidden Markov mo...
متن کاملIntegrating Deep Neural Networks into Structured Classification Approach based on Weighted Finite-State Transducers
Recently, deep neural networks (DNNs) have been drawing the attention of speech researchers because of their capability for handling nonlinearity in speech feature vectors. On the other hand, speech recognition based on structured classification is also considered important since it realizes the direct classification of automatic speech recognition. For example, a structured classification meth...
متن کاملIntegrating Deep Neural Networks into Structural Classification Approach based on Weighted Finite-State Transducers
Recently, deep neural networks (DNNs) have been drawing the attention of speech researchers because of their capability for handling nonlinearity in speech feature vectors. On the other hand, speech recognition based on structured classification is also considered important since it realizes the direct classification of automatic speech recognition. For example, a structured classification meth...
متن کاملA Comparative Study of Gender and Age Classification in Speech Signals
Accurate gender classification is useful in speech and speaker recognition as well as speech emotion classification, because a better performance has been reported when separate acoustic models are employed for males and females. Gender classification is also apparent in face recognition, video summarization, human-robot interaction, etc. Although gender classification is rather mature in a...
متن کامل